Exploring  Speech-Enabled  Dialogue  with  the  Galaxy 
Communicator  Infrastructure 


Samuel  Bayer 
The  MITRE  Corporation 
202  Burlington  Rd. 
Bedford,  MA01730 

sam@  mitre  .oig 


Christine  Doran 
The  MfTRE  Corporation 
202  Burlington  Rd. 
Bedford,  MA01730 

cdoran@mitre.oig 


BiyanGeorge 
The  MfTRE  Corporation 
1 1493Sunset  Mils  Rd. 
Reston,  VA  20190 

bgeoige@mitre.oig 


ABSTRACT 

This  demonstration  will  motivate  some  of  the  signifieant 
properties  of  the  Galaxy  Communieator  Software  Infrastrueture 
and  show  how  they  support  the  goals  of  the  DARPA 
Communieator  program. 

Keywords 

Spoken  dialogue,  speeeh  interfaees 

1.  INTRODUCTION 

The  DARPA  Communieator  program  [1],  now  in  its  seeond 
fiseal  year,  is  intended  to  push  the  boundaries  of  speeeh- 
enabled  dialogue  systems  by  enabling  a  freer  interehange 
between  human  and  maehine.  A  erueial  enabling  teehnology 
for  the  DARPA  Communieator  program  is  the  Galaxy 
Communieator  software  infrastrueture  (GCSI),  whieh  provides 
a  eommon  software  platform  for  dialogue  system  development. 
This  infrastrueture  was  initially  designed  and  eonstrueted  by 
MIT  [2],  and  is  now  maintained  and  enhaneed  by  the  MITRE 
Corporation.  This  demonstration  will  motivate  some  of  the 
signifieant  properties  of  this  infrastrueture  and  show  how  they 
support  the  goals  of  the  DARPA  Communieator  program. 

2.  HIGHLIGHTED  PROPERTIES 

The  GCSI  is  a  distributed  hub-and-spoke  infrastrueture  whieh 
allows  the  programmer  to  develop  Communieator-eompliant 
servers  in  C,  C++,  Java,  Python,  or  Allegro  Common  Lisp.  This 
system  is  based  on  message  passing  rather  than  CORBA-  or 
RPC-style  APIs.  The  hub  in  this  infrastrueture  supports 
routing  of  messages  eonsisting  of  key- value  pairs,  but  also 
supports  logging  and  rule-based  seripting.  Sueh  an 

infrastrueture  has  the  following  desirable  properties: 

•  The  seripting  eapabilities  of  the  hub  allow  the 

programmer  to  weave  together  servers  whieh  may  not 
otherwise  have  been  intended  to  work  together,  by 
rerouting  messages  and  their  responses  and  transforming 


their  keys. 

•  The  seripting  eapabilities  of  the  hub  allow  the 
programmer  to  insert  simple  tools  and  filters  to  eonvert 
data  among  formats. 

•  The  seripting  eapabilities  of  the  hub  make  it  easy  to 
modify  the  message  flow  of  eontrol  in  real  time. 

•  The  seripting  eapabilities  of  the  hub  and  the  simplieity  of 
message  passing  make  it  simple  to  build  up  systems  bit 
by  bit. 

•  The  Standard  infrastrueture  allows  the  Communieator 
program  to  develop  platform-  and  programming- 
language-independent  serviee  standards  for  reeognition, 
synthesis,  and  other  better-understood  resourees. 

•  The  Standard  infrastrueture  allows  members  of  the 
Communieator  program  to  eontribute  generally  useful 
tools  to  other  program  partieipants. 

This  demonstration  will  illustrate  a  number  of  these 
properties. 

3.  DEMO  CONFIGURATION  AND 
CONTENT 

By  way  of  illustration,  this  demo  will  simulate  a  proeess  of 
assembling  a  Communieator-eompliant  system,  while  at  the 
same  time  exemplifying  some  of  the  more  powerful  aspeets  of 
the  infrastrueture.  The  demonstration  has  three  phases, 
representing  three  sueeessively  more  eomplex  eonfiguration 
steps.  We  use  a  graphieal  display  of  the  Communieator  hub  to 
make  it  easy  to  see  the  behavior  of  this  system. 

As  you  ean  see  in  Figure  1,  the  hub  is  eonneeted  to  eight 
servers: 

•  MITRE'S  Java  Desktop  Audio  Server  (JDAS) 

•  MIT  SUMMIT  reeognizer,  using  MIT's  Mereury  travel 
domain  language  model 

•  CMU  Sphinx  reeognizer,  with  a  Communieator-eompliant 
wrapper  written  by  the  University  of  Colorado  Center  for 
Spoken  Language  Researeh  (CSLR),  using  CSLR's  travel 
domain  language  model 

•  A  String  eonversion  server,  for  managing 
ineompatibilities  between  reeognizer  output  and 
synthesizer  input 


CSLR's  eoneatenative  Phrase  TTS  synthesizer,  using  their 
travel  domain  voiee 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

2001 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2001  to  00-00-2001 

4.  TITLE  AND  SUBTITLE 

Exploring  Speech-Enabled  Dialogue  with  the  Galaxy  Communicator 
Infrastructure 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

The  MITRE  Corporation, 202  Burlington  Road, Bedford, MA, 01730 

8.  PEREORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

The  original  document  contains  color  images. 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIEICATION  OE: 

17.  LIMITATION  OE 
ABSTRACT 

18.  NUMBER 
OE  PAGES 

3 

19a.  NAME  OE 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


•  CMU/Edinburgh  Festival  synthesizer,  with  a 
Communieator-eompliant  wrapper  written  by  CSLR,  using 
CMU's  travel  domain  language  model  for  Festival's 
eoneatenative  voiee 

•  MIT  TINA  parser,  using  MIT's  Mereury  travel  domain 
language  model 


•  MIT  Genesis  paraphraser,  using  MIT's  Mereury  travel 
domain  language  model 


We  will  use  the  flexibility  of  the  GCSI,  and  the  hub  seripting 
language  in  partieular,  to  ehange  the  path  that  messages  follow 
among  these  servers. 


3.1  Phase  1 

In  phase  1,  we  establish  audio  eonneetivity.  JDAS  is  MITRE's 
eontribution  to  the  problem  of  reliable  aeeess  to  audio 
resourees.  It  is  based  on  JavaSound  1.0  (distributed  with  JDK 
1.3),  and  supports  barge-in.  We  show  the  eapabilities  of  JDAS 
by  having  the  system  eeho  the  speaker's  input;  we  also 
demonstrate  the  barge-in  eapabilities  of  JDAS  bye  showing 
that  the  speaker  ean  interrupt  the  playbaek  with  a  new 
utteranee/input.  The  goal  in  building  JDAS  is  that  anyone  who 
has  a  desktop  mierophone  and  the  Communieator 
infrastmeture  will  be  able  to  use  this  audio  server  to  establish 
eonneetivity  with  any  Communieator-eompliant  reeognizer  or 
synthesizer. 

3.2  Changing  the  message  path 

The  hub  maintains  a  number  of  information  states.  The 
Communieator  hub  seript  whieh  the  developer  writes  ean  both 
aeeess  and  update  these  information  states,  and  we  ean  invoke 
"programs"  in  the  Communieator  hub  seript  by  sending 
messages  to  the  hub.  This  demonstration  exploits  this 
eapability  by  using  messages  sent  from  the  graphieal  display 
to  ehange  the  path  that  messages  follow,  as  illustrated  in 
Figure  2.  In  phase  1,  the  hub  seript  routed  messages  from  JDAS 
baek  to  JDAS  (enabled  by  the  message  named  "Eeho").  In  the 
next  phase,  we  will  ehange  the  path  of  messages  from  JDAS 
and  send  them  to  a  speeeh  reeognizer. 


File 


conifQis 

1 _ 

Echo 

Rfi  CD  Q  n  tza/S vnt  hesJ  z« 

COB  n  / Para  p  h  rase/Sy  n  t  hesi z  e 

sou  ms  pause 
paus« 

No  pause 

idio 

TFF 


o 


MIT  GENESIS  paraphraserf 

^  ■►niPiri.-" 


FLKjfHT 


Figure  2:  Modifying  the  hub  information  state 


3.3  Phase  2 

Now  that  we've  established  audio  eonneetivity,  we  ean  add 
reeognition  and  synthesis.  In  this  eonfiguration,  we  will  route 
the  output  of  the  preferred  reeognizer  to  the  preferred 
synthesizer.  When  we  ehange  the  path  through  the  hub  seript 
using  the  graphieal  display,  the  preferred  servers  are 
highlighted.  Figure  3  shows  that  the  initial  eonfiguration  of 
phase  2  prefers  SUMMIT  and  Festival. 


Figure  3:  Initial  recognition/synthesis  configuration 

The  SUMMIT  reeognizer  and  the  Festival  synthesizer  were  not 
intended  to  work  together;  in  faet,  while  there  is  a  good  deal  of 
aetivity  in  the  area  of  establishing  data  standards  for  various 
aspeets  of  dialogue  systems  (ef  [3]),  there  are  no 
programming-language-independent  serviee  definitions  for 
speeeh.  The  hub  seripting  eapability,  however,  allows  these 
tools  to  be  ineorporated  into  the  same  eonfiguration  and  to 
interaet  with  eaeh  other.  The  remaining  ineompatibilities  (for 
instanee,  the  differenees  in  markup  between  the  reeognizer 
output  and  the  input  the  synthesizer  expeets)  are  addressed  by 
the  string  server,  whieh  ean  intervene  between  the  reeognizer 
and  synthesizer.  So  the  GCSI  makes  it  easy  both  to  eonneet  a 
variety  of  tools  to  the  hub  and  make  them  interoperate,  as  well 
as  to  insert  simple  filters  and  proeessors  to  faeilitate  the 
interoperation. 

In  addition  to  being  able  to  send  general  messages  to  the  hub, 
the  user  ean  use  the  graphieal  display  to  send  messages 
assoeiated  with  partieular  servers.  So  we  ean  ehange  the 
preferred  reeognizer  or  synthesizer,  (as  shown  in  Figure  4),  or 
ehange  the  Festival  voiee  (as  shown  in  Figure  5).  All  these 
messages  are  eonfigurable  from  the  hub  seript. 


MIT  SUMMIT  recognizer 


ir  [ 


\  this  rftc&g-nizar 


nizer 


''tdlim..’ 


String  server 


Figure  4:  Preferring  a  recognizer 
CJilU  F€:stival  synthesizerC^- _ 

CSLR 


Pieftt  th  I  s  sv:atffBg^gF 

Use  EcaLdLphone 
iced.d  iphone. 

Use  ilLdiphone 
Use  travel  voice 
- - ^ 


^ring 


Figure  5:  Changing  the  Festival  voice 

3.4  Phase  3 

Now  that  we've  established  eonneetivity  with  reeognition  and 
synthesis,  we  ean  add  parsing  and  generation  (or,  in  this  ease, 
input  paraphrase).  Figure  6  illustrates  the  final  eonfiguration, 
after  ehanging  reeognizer  and  synthesizer  preferenees.  In  this 
phase,  the  output  of  the  reeognizer  is  routed  to  the  parser, 
whieh  produees  a  strueture  whieh  is  then  paraphrased  and  then 
sent  to  the  synthesizer.  So  for  instanee,  the  user  might  say  "I'd 
like  to  fly  to  Taeoma",  and  after  parsing  and  paraphrase,  the 
output  from  the  synthesizer  might  be  "A  trip  to  Taeoma". 


4.  CONCLUSION 

The  eonfiguration  at  the  end  of  phase  3  is  obviously  not  a 
eomplete  dialogue  system;  this  eonfiguration  is  missing 
eontext  management  and  dialogue  eontrol,  as  well  as  an 
applieation  baekend,  as  illustrated  by  the  remaining 
eomponents  in  white  in  Figure  7.  However,  the  purpose  of  the 
demonstration  is  to  illustrate  the  ease  of  plug-and-play 
experiments  within  the  GCSI,  and  the  role  of  these  eapabilities 
to  assemble  and  debug  a  eomplex  Communieator  interfaee.  The 
GCSI  is  available  under  an  open  souree  lieense  at 
http://fofoea.mitre.org/download. 


Language 

Generation 


Figure  7:  A  sample  full  dialogue  system  configuration 


5.  ACKNOWLEDGMENTS 

This  work  was  funded  by  the  DARPA  Communieator  program 
under  eontraet  number  DAAB07-99-C201.  ©  2001  The  MITRE 
Corporation.  All  rights  reserved. 


6.  REFERENCES 

[  I  ]  http ://www. darpa.mil/ito/research/com/index.html. 

[2]  S.  Seneff,  E.  Hurley,  R.  Lau,  C.  Pao,  P.  Schmid,  and 
V.  Zue.  Galaxy-II:  A  Reference  Architecture  for 
Conversational  System  Development.  Proc.  ICSLP 
98,  Sydney,  Australia,  November  1998. 

[3]  "'Voice  Browser'  Activity."  http://www.w3.org/Voice. 


